feat(rkllm): per-core NPU pinning and Ollama-compat API by jaylfc · Pull Request #4 · freed-dev-llc/exo-rkllama

jaylfc · 2026-04-12T10:47:56Z

Summary

Add RKNN_CORE_MASK env var to pin the RKLLM engine to specific RK3588 NPU cores (0x1, 0x2, 0x4 for individual cores, 0x7 for all). Useful for multi-tenant workloads on a single board.
Add Ollama-compatible API mode to the HTTP client, tested live against NotPunchnox/rkllama. Uses /api/tags for model listing and /api/generate for inference. This is the default mode since most RK3588 deployments run rkllama.
Add get_capability_descriptor() to expose NPU core count and estimated TOPS to the topology manager.
Add scripts/test-3node-lxc.sh test harness for multi-node testing on a single RK3588 via Incus containers.

Testing

Tested on Orange Pi 5 Plus (RK3588, 16GB RAM, librknnrt 2.3.0):

Health check, model listing, and text generation verified against live rkllama instance
NPU load confirmed via /sys/kernel/debug/rknpu/load showing 3 cores

Notes

Pipeline-parallel layer sharding is not possible with the current RKLLM SDK because .rkllm models are compiled monoliths. Feature request filed at airockchip/rknn-llm#489. The core_mask feature is for multi-tenant/multi-model workloads, not model sharding.

RK3588 has 3 NPU cores at ~2 TOPS each. This change lets you pin an exo node to specific cores so multiple nodes can run on the same board without contention. Set RKNN_CORE_MASK=0x1 for core 0, 0x2 for core 1, 0x4 for core 2, or 0x7 for all three. The core_mask is passed through the HTTP client to rkllama's load_model endpoint, and the engine exposes a capability descriptor with the core count and estimated TOPS so the topology manager can weigh nodes by actual capacity. Also adds a 3-node LXC test harness (scripts/test-3node-lxc.sh) that creates 3 Incus containers on one RK3588, each pinned to a single NPU core, to test distributed pipeline parallelism on consumer ARM hardware.

Tested live against rkllama on RK3588 Orange Pi 5 Plus. The Ollama compat mode uses /api/tags for model listing and /api/generate for inference, matching the NotPunchnox/rkllama server API. This is the default mode since most RK3588 deployments run rkllama. Verified: health check, model listing, and text generation all work against a live rkllama instance with qmd-query-expansion model.

…port (airockchip/rknn-llm#489)

jaylfc added 3 commits April 12, 2026 11:17

docs(rkllm): clarify that pipeline parallelism needs upstream SDK sup…

ba2b355

…port (airockchip/rknn-llm#489)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rkllm): per-core NPU pinning and Ollama-compat API#4

feat(rkllm): per-core NPU pinning and Ollama-compat API#4
jaylfc wants to merge 3 commits intofreed-dev-llc:mainfrom
jaylfc:feat/rknn-core-mask

jaylfc commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jaylfc commented Apr 12, 2026

Summary

Testing

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant